File-Based Storage of Digital Objects and Constituent Datastreams: XMLtapes and Internet Archive ARC Files
نویسندگان
چکیده
This paper introduces the write-once/read-many XMLtape/ARC storage approach for Digital Objects and their constituent datastreams. The approach combines two interconnected file-based storage mechanisms that are made accessible in a protocol-based manner. First, XMLbased representations of multiple Digital Objects are concatenated into a single file named an XMLtape. An XMLtape is a valid XML file; its format definition is independent of the choice of the XML-based complex object format by which Digital Objects are represented. The creation of indexes for both the identifier and the creation datetime of the XMLbased representation of the Digital Objects facilitates OAI-PMH-based access to Digital Objects stored in an XMLtape. Second, ARC files, as introduced by the Internet Archive, are used to contain the constituent datastreams of the Digital Objects in a concatenated manner. An index for the identifier of the datastream facilitates OpenURL-based access to an ARC file. The interconnection between XMLtapes and ARC files is provided by conveying the identifiers of ARC files associated with an XMLtape as administrative information in the XMLtape, and by including OpenURL references to constituent datastreams of a Digital Object in the XML-based representation of that Digital Object.
منابع مشابه
ar X iv : c s . D L / 0 50 30 16 v 2 3 J un 2 00 5 File - based storage of Digital Objects and constituent datastreams : XMLtapes and Internet Archive ARC files
This paper introduces the write-once/read-many XMLtape/ARC storage approach for Digital Objects and their constituent datastreams. The approach combines two interconnected file-based storage mechanisms that are made accessible in a protocol-based manner. First, XMLbased representations of multiple Digital Objects are concatenated into a single file named an XMLtape. An XMLtape is a valid XML fi...
متن کاملaDORe: A Modular, Standards-Based Digital Object Repository
This paper describes the aDORe repository architecture designed and implemented for ingesting, storing, and accessing a vast collection of Digital Objects at the Research Library of the Los Alamos National Laboratory. The aDORe architecture is highly modular and standards-based. In the architecture, the MPEG-21 Digital Item Declaration Language is used as the XML-based format to represent Digit...
متن کاملMigrating Content in WARC Files
Heritage institutions all over the world started on harvesting and preserving resources of the World Wide Web for future generations as part of our culture heritage. This task tends to be a non-trivial one because of two complex challenges: (1) crawling the enormous data amount located in the Internet and (2) performing long term preservation strategies on these data. Nowadays a lot of effort i...
متن کاملStudies on the scalability of web preservation
This paper describes a mechanism for improving the scalability of preservation actions on large linked archives, such as WARC and ARC files produced from the archiving of web sites. To enable accurate but efficient preservation actions, information on the files embedded within a container object, such as the file formats of the embedded files, are aggregated and recorded as properties of the co...
متن کاملRepository and Preservation Storage Architecture
While the Open Archive Information System (OAIS) model has become the de facto standard for preservation archives, the design and implementation of a repository or reliable long term archive lacks adopted technology standards and design best practices. This paper is intended to provide guidelines and recommendations for standards implementation and best practices for a viable, cost effective, a...
متن کامل